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ABSTRACT 

Motivation: Copy number variations (CNVs) are a major source of 
genomic variability and are especially significant in cancer. Until re- 
cently microarray technologies have been used to characterize CNVs 
in genomes. However, advances in next-generation sequencing tech- 
nology offer significant opportunities to deduce copy number directly 
from genome sequencing data. Unfortunately cancer genomes differ 
from normal genomes in several aspects that make them far less 
amenable to copy number detection. For example, cancer genomes 
are often aneuploid and an admixture of diploid/non-tumor cell frac- 
tions. Also patient-derived xenograft models can be laden with mouse 
contamination that strongly affects accurate assignment of copy 
number. Hence, there is a need to develop analytical tools that can 
take into account cancer-specific parameters for detecting CNVs 
directly from genome sequencing data. 

Results: We have developed WaveCNV, a software package to iden- 
tify copy number alterations by detecting breakpoints of CNVs using 
translation-invariant discrete wavelet transforms and assign digitized 
copy numbers to each event using next-generation sequencing data. 
We also assign alleles specifying the chromosomal ratio following 
duplication/loss. We verified copy number calls using both microarray 
(correlation coefficient 0.97) and quantitative polymerase chain 
reaction (correlation coefficient 0.94) and found them to be highly 
concordant. We demonstrate its utility in pancreatic primary and 
xenograft sequencing data. 

Availability and implementation: Source code and executables 
are available at https://github.com/WaveCNV. The segmentation 
algorithm is implemented in MATLAB, and copy number assignment 
is implemented Perl. 

Contact: lakshmi.muthuswamy@gmail.com 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

DNA copy number variations (CNVs) are associated with a wide 
range of diseases including cancer where detection of copy 
number alterations has led to guided-therapeutic interventions. 
For example, amplification of the ERBB2 locus is used to iden- 
tify patients for trastuzumab treatment. Although Comparative 
Genome Hybridization (CGH), microarray s have an intrinsic 
kilobase (kb) resolution for CNV detection, the advent of high- 
throughput next-generation sequencing (NGS) technologies 
offers us the potential to probe genomic structural variation at 
base-pair level. However, with the increase in signal resolution 
comes a substantially increased noise signature and the problem 
of how to remove false positives. Recent efforts by various 
groups (Abyzov et aL, 2011; Ivakhno et al., 2010; Kim et al., 
2010; Klambauer et ai, 2012; Magi et al., 2011; Medvedev et al., 
2009; Miller et al., 2011; Waszak et al., 2010; Xie and Tammi, 
2009; Yoon et al., 2009) have attempted to mitigate the noise by 
carrying out a smoothing (binning) of the sequencing read depth 
on scales of tens to hundreds of base pairs and examining this 
smoothed read depth. The smoothing process is performed on a 
set, arbitrary, scale, which can smooth-out physically interesting 
features of a signal. This is of significant concern for cancer 
genomes, which are known to have unstable genomes that con- 
stantly evolve. Smoothing methods also assume that the noise 
signature of the signal is overwhelmingly concentrated on a 
single base-pair genomic scale (high frequency) and ignores the 
possibility of strong long-range (low-frequency), systemic, corre- 
lated noise that may increase the false -positive rate of any detec- 
tion algorithm. Another recent effort, Varbin (Baslan et al., 
2012) uses a variable binning approach to take into account an 
uneven distribution of mappable reads. Although this method is 
suitable for low or sparse coverage as illustrated in single cell 
sequencing (Navin et al., 2011), it does not fully harness the 
available base-pair-scale genomic resolution. 

Assignment of digitized copy number to genomic segments in 
tumors is further comphcated in cancer genomes due to a 
number of sample- specific confounding factors. For example, 
primary tumor tissues may contain low tumor cellularity due 
to an admixture of diploid/non-tumor cell fraction in patient 
samples, including pancreatic cancer where tumor cellularity 
can vary from 5 to 80%, thus making the detection of cancer 
driver mutations difficult (Biankin et al., 2012). In addition to 
primary tumors, patient-derived samples grown in mouse 
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Fig. 1. Flow chart showing the analysis procedure 



xenograft (PDX) models are being increasingly used in pre-clin- 
ical settings to understand tumor biology and therapy response 
(Huynh et aL, 2011; Morton and Houghton, 2007). Assignment 
of digitized copy number to CNVs in these models becomes in- 
creasingly difficult due to mouse contamination of the tumor 
samples that introduces noise in the sequencing coverage as 
well as allele frequencies for SNVs (both of which are integral 
to CNV caUing methods). Although algorithms such as qpure 
(Song et aL, 2012), genoCN (Sun et aL, 2009), ASCAT (Van Loo 
et aL, 2010) and ABSOLUTE (Carter et aL, 2012) model for 
stromal contamination and ploidy estimation on SNP array 
data, they do not function for genome sequencing data. Also 
these methods cannot correct for the additional effects of xeno- 
graft mouse contamination. 

To fill this need, we have developed WaveCNV, a tool that 
uses DNA sequencing data to model for complex cancer 
genomes. The algorithm estimates ploidy, tumor cellularity in 
primary tumors, mouse content in xenograft models and assigns 
digitized copy numbers and alleles to indicate which parental 
chromosome pair was affected by each copy number event. 
Also to overcome limitations associated with binning-based 
approaches, we use the well-established theory of wavelets to 
take full advantage of the genomic resolution available in 
sequencing data. Figure 1 illustrates the overall flowchart of 
data generation and copy number modeling. 



2 METHODS 

2.1 Segmentation algorithm 

Wavelet theory is used both for denoising of the depth of coverage in 
NGS data (which is inherently multiscale and carries non-uniform cover- 
age signal) and to identify rapid transitions corresponding to CNV break- 
points. The wavelet transform (Mallat, 2008) breaks a given signal into 
different frequency components with a resolution matched to its intrinsic 



scale and can thus claim fundamental advantages over traditional Fourier 
methods in detecting sharp localized discontinuities as observed in copy 
number alterations. This specific property of wavelet transform is crucial 
in analyzing signals, specifically NGS coverage data, where size of copy 
number alterations can vary from base pair to length of a chromosomal 
arm. We give a brief description here, while the mathematical details are 
provided in Supplementary Materials S.l and S.2. 

We first select the wavelet basis function by using the inherent nature 
of copy number alterations that a genomic region with a read depth / is 
likely to make digitized step transitions and hence choose the simplest of 
all wavelets, a step function or the Haar wavelet. Given our choice of the 
Haar basis, we use a translation-invariant discrete wavelet transformation 
on the normalized read depth (Coifman and Donoho, 1995) to obtain 
detailed signal frequency and scale information — encapsulated by the 
approximation and detail coefficients. The approximation coefficients 
will contain both the low-frequency component (feature sizes of the 
order of a few kilobases) and a high-frequency component unique to 
sequencing data (feature sizes less than a kilobase). The detail coefficients 
will contain an exclusively high-frequency component, which is more 
likely to have significant noise but also possibly important small-scale 
insertions and deletions. We scan across scales of interest by successively 
iterating the decomposition of signal /, with successive approximation 
coefficients being decomposed in turn. This results in the signal being 
broken down into many lower genomic-resolution components starting 
from a small scale. 

We then use de-noised approximation coefficients to define boundaries 
where there is a transition from one copy number state to another. 
Detection of breakpoints is achieved by asking when the coefficients of 
the maximal scale intersect those of the finest scale as given in Equation 
(1). For reasons of economy and because the CNV distribution is largely 
unknown, we examine the intersections between the approximation coef- 
ficients at entropy scale (<3L*) and the partial autocorrelation scale (aP) 
(Supplementary Material S.2). 



E sgn 



The main point in the approach is that by examining the zero crossings 
of this special function in Equation (1), we should have an extremely low 
false-negative rate owing to the inherent sensitivity of the Haar wavelets 
to abrupt changes in the signal at this wide range of scales. One can show 
that this procedure is equivalent to searching for local maxima of the 
squared modulus of the dominant wavelet coefficients in the signal 
(Legarreta et al., 2005). Figure 2 illustrates clearly that the major features 
of the signal discontinuities are captured by the wavelet transformed 
and de-noised signal, but with subtle differences in that the detail coeffi- 
cients are extremely sensitive to steep-gradient features and miss gradual 
read depth changes that are instead captured by the approximation 
coefficients. 

2.2 Allele-specific copy number estimation 

After the breakpoints are detected using our segmentation algorithm, we 
assign digital copy numbers to each segment. Our basic method is similar 
to copy number models applied to microarray data (Sun et al., 2009; Van 
Loo et al., 2010; Wang et al., 2007) with additional layers of complexity 
added to the model due to tumor-specific confounding factors 
(Supplementary Materials S.5-S.11 and methods below). We use sequen- 
cing coverage modeled as a Poisson distribution and minor allele 
frequency (MAF) modeled as a binomial distribution to assign digitized 
copy numbers to each CNV event. We also assign alleles to each copy 
number event describing the parental chromosome ratio following each 
duplication or loss. For example, a three-copy region might have an allele 
of 1:2 (one copy from the first parental chromosome and two copies from 
the other parental chromosome), whereas allele 0:3 would also be possible 
(three copies of one parental chromosome and complete loss of the 
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Fig. 2. Detection of signal discontinuities using wavelet transformed and de-noised signal over a 16 kb region. Top panel shows the raw read depth (gray) 
and the denoised signal (red). Bottom panel illustrates copy number break points where the coefficient of the maximal scale intersects those of the finest 
scale. The y-axis is the squared approximation wavelet coefficient, and x-axis is the genomic position in megabases 



other). Alleles are assigned based on MAF distribution within the CNV 
event, which will be specific to chromosomal balance (e.g. a 1:2 allele 
would produce MAF distribution peaks at 0.33 for SNVs on one 
chromosome and 0.66 for SNVs on the other chromosome). Allelic 
assignment is possible in cancer because somatic duplication/loss events 
are recent, so linkage among SNVs is not expected to break down as it 
does in germline CNVs. The allele assignments in WaveCNV can be used 
to associate CNVs with SNVs/indels that appear to be preferentially 
gained or lost. 

In addition to modeling for basic coverage and MAF, we also model 
for aneuploidy, normal/diploid contamination of primary tumor samples, 
mouse contamination of human tumors grown in xenograft and we per- 
form auto-correction of systematic sequencing biases using matched 
normal/control samples. For validation purposes, we used WaveCNV 
to identify CNV events in human pancreatic cancer samples. 
Sequencing data were aligned using Novoalign (Novovcraft, Inc.) and 
processed using Genome Analysis Tool Kit to identify SNVs and 
minor allele frequencies (see Fig. 1 and Supplementary Material S.3 for 
data generation and S.4 for data pre-processing). 

2.3 Estimation of minimum detectable CNV length 

Given that coverage is modeled as a Poisson distribution, the variance for 
the median coverage can be approximated after adapting Raikov's the- 
orem using the equation: 

V=cjn^ (2) 

where Cg is the expected segment median coverage and is the number of 
independent data points in the region (See Supplementary Material S.5). 
Variance is thus a function of both coverage and segment length, and a 
relationship can be derived to identify the minimum segment length 
required to identify a copy number event to a specified confidence thresh- 
old (See Supplementary Materials S.5 and S.7). 

The length of all segments must then satisfy the following relationship 
to be detectable: 



where c, is the average expected median coverage on the region of interest, 
d is the difference in coverage from the neighboring segment and o; is a 
selected threshold factor (3.890592 for 0.01%). This relationship specifies 
that events become detectable with either deeper coverage or longer 



segments, and low copy events are more easily distinguished than high 
copy events. Such information is invaluable because it allows us to de- 
termine the minimum sequencing coverage required before even begin- 
ning an experiment. This can be especially useful when sequencing tumor 
samples with diploid/normal fraction contamination that dilutes apparent 
separation between copy number levels. For example, the smallest events 
that could be identified in a primary tumor sample sequenced with 101 
base pair reads and having a cellularity of 0.20 would be ~7 kb in length 
at 30 X coverage and ~2kb at lOOx coverage. We also use this relation- 
ship to simplify our calling algorithm and improve run times by merging 
short segments before calculating fits to each copy number model. 

2.4 Estimation of mouse contamination in xenograft 
models 

Human derived tumors are commonly grown as xenografts in mice to 
facilitate continued study of the tumor's biology or increase total tumor 
content of low cellularity tumor types. When using these xenografted 
samples with NGS, mouse DNA contamination of the human-derived 
tumors can introduce confounding factors into both coverage and MAF, 
which can falsely alter the apparent copy number. The overall effect of 
this contamination becomes more extreme as the mouse content of the 
sample increases. One approach to removing mouse contamination used 
by tools like Xenome (Conway et al., 2012) is to try and directly identify 
non-human sequencing reads and remove them upstream of any data 
processing. However, there are many conserved regions of high sequence 
identity between human and mouse for which sequencing reads cannot be 
separated in this way. Unfortunately these regions of conservation are 
primarily concentrated in gene coding regions (which are of main interest 
in cancer analysis). We thus take another approach that could be used in 
complement with tools like Xenome. We adjust expected coverage higher 
and shift expected MAF values based on the estimated amount of mouse 
contamination in the region (The mouse is assumed to come from an 
inbred line, so it will be diploid and homozygous for most mouse-specific 
SNVs). 

Figure 3A and B show a two copy 30-megabase region observed in 
chromosome 1 of a human pancreatic cancer cell line that is expected to 
be free of mouse DNA contamination. The observed MAF distribution 
for the cell line (Fig. 3B red line) is centered around the MAF value of 0.5 
and closely matches the calculated expected MAF distribution (Fig. 3B 
blue line). For the exact same two copy 30-megabase region from the 
same human tumor sample grown in xenograft, the observed distribution 
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Fig. 3. MAP distribution of SNVs in a 30 Mb region of chrl . (A) MAP 
density in a pancreatic cancer cell line; (B) observed (red) and normal 
fitted expect (blue) distribution curves of MAP for pancreatic cancer cell 
line; (C) MAP density in a pancreatic xenograft model; (D) observed 
(red), normal fitted expect (blue) and expect with mouse contamination 
(green) for pancreatic xenograft model 

of the MAP (Pig. 3D red line) is centered at 0.47, below the expected 
value of 0.5 (Pig. 3D blue line). There is also an observable band of data 
introduced by the mouse contamination around 0.16 (Pig. 3C and D blue 
arrows). Owing to the multi-modality of the MAP distribution, the ob- 
servation deviates significantly from the expected distribution curve. 

In our CNV calling algorithm, we adjust the expected MAP frequen- 
cies to take confounding factors caused by the aligning mouse reads into 
account by adding an independent distribution peak for mouse-derived 
SNVs as well as modeling for the degree that MAP peaks will be shifted 
by mouse reads (mouse-derived SNVs will be two copy homozygous for 
inbred lines). Our improved expected distribution seen in Pigure 3D 
(green line), clearly matches the observed distribution (red line) better 
than the standard expect (blue line). We also alter expected coverage 
for the segments by estimating the quantity of mouse reads that will 
align (these values are fixed into WaveCNV, but can also be supplied 
as a BAM file if mouse was sequenced independently). 

Based on kernel density estimation of mouse-expected coverage, the 
average mouse contamination of this particular xenograft was 21% of the 
total DNA content of the sample. We validated the estimated mouse 
contamination using qPCR. Two target loci were chosen such that 
one of them maps uniquely to human and another to the mouse 
genome. The values from TaqMan® qPCR analysis were used to calcu- 
late the relative absolute quantity between human and mouse probes, 
which demonstrated a 27% mouse contamination in the pancreatic xeno- 
graft compared with the tumor cell line derived from the same tumor. 
Thus these two alternate approaches ascertain the estimation of mouse 
contamination using our model within an acceptable margin of error. 

2.5 Estimation of cellularity in primary tumors 

Normal/diploid cell contamination of primary tumors complicates CNV 
calling by diluting signal from the tumor cells and reducing the amount of 
observed coverage separating copy number levels as well as altering the 



Mixed tumor fraction 


WaveCNV estimate 


0.05 


0.043 


0.10 


0.088 


0.15 


0.155 


0.20 


0.236 


0.40 


0.403 


0.60 


0.602 


1.00 


1.00 



Note: The table shows WaveCNV-derived cellularity estimates for 
a dilution series of diploid/normal contamination mixed into a 
pancreatic cancer cell line model. 



expected minor allele frequencies at each copy number level. Corrections 
for shift in coverage and MAP can be obtained if you know the cellularity 
of a sample. Previously qpure (Song et al., 2012) has attempted to esti- 
mate cellularity using a relationship for the shift in MAP in the single 
outermost peak of loss of heterozygosity (LOH) events. Notably, how- 
ever, they found that the relationship they use does not hold linear for 
values <20% cellularity. We followed an approach similar to theirs by 
using the shift in MAP for LOH events to estimate cellularity; however, 
we make use of LOH events at multiple copy number levels and derived a 
relationship that does hold Hnear even at low cellularity: 



1 



Mloh 



N+2 



(4) 



where T is the tumor cellularity, is the copy number of the region, and 
Mloh is the left-most central MAP peak for the region at copy number 
A^. The slope of the relationship is therefore a function of the cellularity T. 
Also because the y intercept of the relationship is always fixed at 2 
(reciprocal of MAP 0.5), we can fit to the proper copy number for 
complex aneuploidy events. 

Supplementary Pigure S4, panel A clearly shows the outer most MAP 
peaks for copy numbers 1-3 of a patient-derived pancreatic cancer pri- 
mary tumor sample. As shown in Supplementary Pigure S4, panel B, 
when the MAP values from LOH peaks are used with Equation (4), 
the slope allows us to derive the cellularity of the sample. The resulting 
slope 0.611 (7^^ = 0.99876) corresponds to a cellularity of 0.38 for this 
tumor sample. 

We further vaHdated our model using a dilution series of pancreatic 
tumor cells derived from a primary tumor cell line mixed with increasing 
quantities of diploid cells derived from matched normal. Table 1 shows a 
convincing vaHdation of tumor content estimation for these samples ran- 
ging from 5 to 100% cellularity. Estimates match well with expected 
values even for low cellularities, demonstrating the effectiveness of our 
method. 

Identifying genomic mutational landscape has been difficult in tumor 
genomes where tumor content is <20%. However, using sequencing data, 
it may now be possible to use low cellularity tumors to detect mutational 
landscape if coverage is sufficient [overall coverage determines the min- 
imum length of detectable copy number events according to Equation (3) 
earher mentioned in the text]. 

2.6 Estimation of ploidy 

One of the most difficult aspects of assigning digital copy number values 
to a sample is in determining what the expected coverage or copy neutral 
coverage would be. Many algorithms assume that the majority of a 
sample is diploid and any gains and losses are determined based on 
normalizing the coverage of each chromosome using this assumption. 
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This becomes problematic especially for tumor samples where the major- 
ity of the genome is often not expected to be diploid. 

We have developed a procedure for identifying the base coverage cor- 
responding to a one-copy shift that can be used to determine the ploidy of 
a given sample. Because multiples of the optimal value for the base cover- 
age should correlate with the observed coverage for all segments of the 
genome, we perform an iterative search for a value that generates a 
genome-wide maximum coverage likelihood while simultaneously gener- 
ating the best fit to an MAF as measured by residual sum of squares (rss). 
This conveniently happens at the point of maximum separation between 
normalized curves of coverage likelihood and rss. The overall procedure 
for selecting a base coverage is further detailed in the Supplementary 
Material S.IO. 

We validated our procedure using a triploid pancreatic tumor sample 
and its diploid matched control/normal. The expected coverage median 
for the diploid matched normal genome was determined to be 38 using 
Gaussian kernel density estimation (Fig. 4A). A search through the base 
coverage candidate space (Fig. 4B) using normalized coverage likelihood 
(red line) and normalized rss fit for MAF (blue line) reveals that max- 
imum separation (yellow line) occurs at coverage 19.73. Given that the 
kernel-derived genome median coverage is 38, a base coverage of 19.73 



A Kernel Fit of B Coverage Fit for 
Expected Coverage Base Copy Number 




Sequence Coverage Sequence Coverage 



would give a correct ploidy estimate of two for the genome. When the 
same procedure is applied to the triploid tumor sample (Fig. 4C and D) 
the base coverage is calculated to be 9.84 and the expected median cover- 
age is 28, giving a correct ploidy estimate of three for the sample. 

2.7 Matched normal corrects for coverage bias and 
germline events 

Because WaveCNV is a somatic CNV caller, as CNVs are assigned to the 
tumor sample it can simultaneously assign copy numbers to the same 
segment in the diploid matched control (sequenced together with the 
tumor). This allows WaveCNV to determine if given losses, gains and 
LOH events are in fact somatic or germline events. Furthermore, the 
matched normal also allows WaveCNV to correct for anomalies present 
in the reference sequence including systematic variance in coverage, high 
repeat regions, unsequencable regions with consistently missing coverage 
and so forth. Supplementary Figure S5 clearly demonstrates the decrease 
in variance for genomic coverage when a matched control based correc- 
tion is applied (blue line) as opposed to the standard coverage distribu- 
tion (red line). Final somatic calls are highlighted in the output report to 
distinguish them from other copy number calls, thus allowing researchers 
to immediately focus on the events most likely to be important to tumor 
progression. Further details for matched normal/control-based correction 
are found in the Supplementary Materials S.ll. 



3 RESULTS 

We identified 764 somatic copy number aberrations in pancreatic 
cancer genome sequencing data using WaveCNV. The size of 
CNV events varied from 284 bp to 33 Mb. Supplementary 
Table S4 lists these events along with their verification status 
using alternate platforms, and Supplementary Figure S7 illus- 
trates the size distribution of those events. 
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Fig. 4. Modeling for aneuploidy. (A) The expected segment median 
coverage for a diploid genome is estimated using kernel density estima- 
tion. This value then serves to define a range for estimating the sample 
base coverage (coverage of copy number 1). (B) The normalized likeli- 
hood of the observed coverage (red line) as well as the normalized 
residual sum of squares value (rss) for all MAF distribution fits (blue 
line) are calculated for each candidate base coverage (assuming ploidy 
range 1^). The base coverage that produces the maximum separation 
between likelihood and rss (yellow line) is then selected. (C and D) show 
the expected segment median coverage and the base coverage selected for 
a triploid genome 



3.1 Ascertainment of somatic CNVs using microarray and 
qPCR technologies 

CNVs identified using WaveCNV were verified using three alter- 
nate technologies: Nimblegen, 2.1 million CGH tiling array, 
lUumina 1 miUion Omni-quad SNP array and verification of 
80 CNV loci with copy number varying from 0 to 30 using 
qPCR. We find a high correlation between CN estimated from 
qPCR method and WaveCNV as shown in Figure 5 A. We fit 
linear regression and found that regression coefficient (0.94) with 
P<2e-16. With the regression coefficient close to 1, it confirms 
that our CN model used in WaveCNV algorithm is able to pre- 
dict accurately a wide range of copy numbers. We also compared 
CNVs from the whole genome with two different array-based 
platforms, lUumina Omni 1 M quad and Nimblegen 2.1 M 
array CGH. Invariably, most of the array platforms have 
lower dynamic range compared with sequencing that results in 
approximate digitized CN. Hence we compared CN from 
sequencing to the median intensity signal of the probes covering 
the region from array platforms as shown in Figure 5B and C. 
We find a high concordance between array platforms and 
WaveCNV. The weighted Pearson correlation coefficients are 
calculated to be 0.86 for lUumina array and 0.97 for 
Nimblegen array with weights proportional to the length of the 
segment. 
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Fig. 5. Validation of copy number calls using three methods. 
(A) Verification of 80 CNV loci by qPCR on a pancreatic cancer 
genome. Copy numbers from qPCR were estimated based on threshold 
cycle (Ct) values. The Pearson correlation coefficient is 0.94. (B) 
Verification of 473 somatic CNVs on the whole-genome using Illumina 
Human Omni 1 Million microarray. Shown here is the concordance be- 
tween intensity ratios in microarray to WaveCNV CN. The Pearson cor- 
relation coefficient is 0.86. (C) Verification of 468 somatic CNVs on the 
whole genome using Nimblegen 2.1 Million aCGH microarray. 
Shown here is the concordance between aCGH intensities ratio 
in microarray to WaveCNV CN. The Pearson correlation coefficient 
is 0.97 



3.2 Algorithm performance comparison 

We additionally compared our results to the sequencing-based 
CNV caUing algorithms CNVnator (Abyzov et al, 2011) and 
OncoSNP-SEQ (Yau, 2013). We used base pair level congruency 
between algorithm calls to compare matches. We define congru- 
ency to be the average of sensitivity (the fraction of a reference 
feature predicted) and specificity (the fraction of a prediction 
overlapping a reference feature). In all cases the reference is the 
algorithm we are comparing with. 

Table 2 shows the comparative statistics between the three 
algorithms. Comparing copy number events observed in 
WaveCNV with CNVnator, we observed an overall congruency 
of 93% (95% in gains and 92% in losses). When comparing 
WaveCNV to OncoSNP-SEQ (a cancer-specific CNV caller), 
we see an overall congruency of 80% (87% for amplifications 
and 80% for deletions). The lower match for OncoSNP-SEQ is 
primarily due to our sample coverage being lower than that rec- 
ommended for accurate OncoSNP-SEQ performance (our 
sample was sequenced to 30 x, whereas OncoSNP-SEQ requires 
a minimum of 60 x). 

Our concordance with CNVnator is one of the highest 
reported so far between any two programs for sequencing data 
thereby supporting the vahdity of our algorithm. There are key 
additional features that are unique to our algorithm, which are 
critical for cancer genomes. WaveCNV successfully combines the 
read depth distribution, MAF and reference-based normalization 
of tumor with matched normal to estimate ploidy of the genome 
and corrects for mouse contamination with the additional bene- 
fits of copy number allele assignments and LOH detection. 
We have a well-defined mechanism to control for detectable 
event sizes at different levels of sequencing coverage and tumor 
sample cellularity. Also although WaveCNV can assign copy 
numbers to any segment within the genome, the primary focus 
of cancer research is on the somatic changes and somatic CNVs 
are identified in our output by integrating matched normal/ 
controls into the copy number analysis. 



4 DISCUSSION 

We have developed a computational algorithm to detect CNV 
boundaries from whole-genome sequencing data and assigned 
digitized copy number by modehng for sample- specific con- 
founding factor such as aneuploidy, normal/diploid contamin- 
ation of primary tumors and mouse contamination in xenograft 
models. The segmentation algorithm based on wavelet transform 
provides a unique opportunity to probe the genome in any 



Table 2. WaveCNV comparison to other algorithms 



Algorithm Events Gains Losses Total basepair Total basepair Congruency Congruency Congruency 

gains losses gains losses all 



WaveCNV 764 359 405 312922439 567442194 - - - 

CNVnator 3658 829 2829 319 703 400 622106100 0.95 0.92 0.93 

OncoSNP 1423 567 856 260783488 912819293 0.87 0.80 0.80 



Note: This table shows the base pair level congruency in copy number alterations called by WaveCNV compared with CNVnator and OncoSNP-SEQ. 
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spatial genomic scale. Although the first part of the algorithm 
identifies all discontinuities, the second part of WaveCNV pro- 
vides a statistical framework to assign CN and merge neighbor- 
ing events carrying the same copy number. This corrects for false 
shearing of copy number events that may arise due to poor qual- 
ity of sequencing data. 

A key component of WaveCNV is the matched-normal-based 
copy number correction. Being aware of the diploid control 
ensures that any systemic artifacts that may appear in both 
tumor and normal genomes, including platform- specific biases, 
unsequenceable regions and so forth, are effectively removed or 
corrected for. This resulted in a high concordance between 
somatic CN calls from our algorithm in sequencing data to 
both microarray data and qPCR. 

Xenograft models for many types of primary tumors have 
increasingly become useful tools to understand cancer biology 
and to test therapeutic targets. Our model estimates mouse 
contamination and the reported allele and copy number reflects 
the correction for mouse contamination. The mouse contamin- 
ation estimate matches well with our mouse- specific qPCR data. 
On the same note, most directly sequenced primary tumor 
samples contain stromal contamination, and our algorithm can 
quantify and model for the presence of contaminating diploid 
cells in that sequencing data. 

5 CONCLUSION 

Our segmentation algorithm is unique from its methodology per- 
spective, and can potentially improve the boundary assignments 
on the smaller CNV events found via whole-genome sequencing. 
In addition, the assignment of specific alleles to copy number 
losses/gains can give researchers the ability to explore relation- 
ships between selected sequence mutations and structural vari- 
ation. For example, in pancreatic cancer a KRAS activating 
point mutation is often coupled with duplication events, thus 
amplifying the effect of this oncogene. Being able to identify 
similar correlations based on reports from our algorithm could 
prove useful in prioritizing-specific genes for further study. 
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